Skip to content

Conversation

@badgersec
Copy link

URL Normalizing for seeds

The crawler failed on URLs with query parameters because state.ts normalized them alphabetically when queueing but seeds.ts didn't normalize before matching, causing "Page no longer in scope" errors.

 docker run -it --rm -v $PWD/crawls:/crawls/ \
  webrecorder/browsertrix-crawler:latest crawl \
  --url https://www.facebook.com/permalink.php\?story_fbid\=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl\&id\=100082135548177 \
  --scopeType page

results in "Page not in scope":

{"timestamp":"2026-01-27T06:41:47.635Z","logLevel":"info","context":"general","message":"Page no longer in scope","details":{"url":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","seedId":0,"depth":0,"extraHops":0,"retry":0,"status":0,"pageid":"ef2dcbcc-775c-41f6-8477-117377b88b5a","callbacks":{},"isHTMLPage":true,"skipBehaviors":false,"pageSkipped":false,"noRetries":false,"asyncLoading":false,"filteredFrames":[],"loadState":0,"contentCheckAllowed":false,"logDetails":{}}}

whereas, in a previous version, 1.9.2 resulted in is success.

Different array-valued headers fix:

Running a crawl with additional extraHops results in "not a valid multi value header":

docker run -it --rm -v $PWD/crawls:/crawls/ \
  webrecorder/browsertrix-crawler:latest crawl \
  --url https://www.facebook.com/permalink.php\?story_fbid\=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl\&id\=100082135548177 \
  --scopeType page --extraHops 1
...
{"timestamp":"2026-01-27T07:09:17.486Z","logLevel":"warn","context":"fetch","message":"Async load headers failed","details":{"type":"exception","message":"not a valid multi value header","stack":"Error: not a valid multi value header\n    at Pe (file:///app/node_modules/warcio/dist/index.js:1:2889)\n    at AsyncFetcher.loadHeadersFetch (file:///app/dist/util/recorder.js:1254:38)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async AsyncFetcher.loadHeaders (file:///app/dist/util/recorder.js:1182:27)\n    at async AsyncFetcher.load (file:///app/dist/util/recorder.js:1156:19)\n    at async file:///app/node_modules/p-queue/dist/index.js:118:36","page":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","workerid":0}}
...

This was because recorder.ts called multiValueHeader() on all array-valued headers, but warcio.js 2.4+ only allows set-cookie, warc-concurrent-to, and warc-protocol, causing "not a valid multi value header" errors on Facebook and other sites that return different array-valued headers.

@Mr0grog
Copy link

Mr0grog commented Jan 28, 2026

I ran into this issue last night while trying to archive some papers and studies hosted by EPA that we believe are at risk of deletion (e.g. https://ordspub.epa.gov/ords/eims/eimscomm.getfile?p_download_id=552173) and which contain duplicate Strict-Transport-Security and X-Content-Type-Options headers.

This solution seems like it would work 🎉, BUT I kinda feel like this is an upstream issue in Warcio… concatenating with a comma is not quite faithful to the original response, and it seems like it would be better if Warcio was only choosy about which multi-valued WARC record headers were allowed, and not about the HTTP headers that are in the WARC record’s payload. It seems like the same test and multi-value function/method is being used for both cases, and that’s not ideal.

@Mr0grog
Copy link

Mr0grog commented Jan 28, 2026

Nevermind, I spoke too soon! Looks like it did get fixed upstream in 2.4.8: webrecorder/warcio.js#94

@ikreymer
Copy link
Member

@badgersec thanks for the fix, I think the normalization definitely makes sense!

We are in the process of fixing the header issue as well, @Mr0grog, there's actually a follow-up PR:
webrecorder/warcio.js#95 and plan to merge that with #952 so that there is still checking of multi value headers, and certain http headers are still disallowed. Let us know if you have any feedback on that!

@socket-security
Copy link

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedwarcio@​2.4.7 ⏵ 2.4.980 +1100100 +191 +8100

View full report

@badgersec
Copy link
Author

Rebased on #952 -- test results inline.

command:

> docker run -it --rm -v $PWD/crawls:/crawls/ \   browsertrix-crawler:final-fix crawl \
  --url https://www.facebook.com/permalink.php\?story_fbid\=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl\&id\=100082135548177 \
  --scopeType page

result:

{"timestamp":"2026-01-29T01:54:22.993Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.11.1 (with warcio.js 2.4.7)","details":{}}
{"timestamp":"2026-01-29T01:54:22.995Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","scopeType":"page","include":[],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"auth":null,"_authEncoded":null,"maxExtraHops":0,"maxDepth":0}]}
{"timestamp":"2026-01-29T01:54:22.995Z","logLevel":"info","context":"general","message":"Link Selectors","details":[{"selector":"a[href]","extract":"href","isAttribute":false}]}
{"timestamp":"2026-01-29T01:54:22.995Z","logLevel":"info","context":"general","message":"Behavior Options","details":{"message":"{\"autoplay\":true,\"autofetch\":true,\"autoscroll\":true,\"siteSpecific\":true,\"log\":\"__bx_log\",\"startEarly\":true,\"clickSelector\":\"a\"}"}}
{"timestamp":"2026-01-29T01:54:23.709Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}}
{"timestamp":"2026-01-29T01:54:23.710Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"timestamp":"2026-01-29T01:54:23.819Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl"}}
{"timestamp":"2026-01-29T01:54:23.821Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2026-01-29T01:54:23.712Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.facebook.com\\/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl\",\"added\":\"2026-01-29T01:54:23.102Z\",\"depth\":0}"]}}
{"timestamp":"2026-01-29T01:54:25.287Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","workerid":0}}
{"timestamp":"2026-01-29T01:54:30.491Z","logLevel":"info","context":"general","message":"Seed page redirected, adding redirected seed","details":{"origUrl":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","newUrl":"https://www.facebook.com/login/?next=https%3A%2F%2Fwww.facebook.com%2Fpermalink.php%3Fid%3D100082135548177%26story_fbid%3Dpfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","seedId":1}}
{"timestamp":"2026-01-29T01:54:34.442Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.facebook.com/permalink.php?id=100082135548177&story_fbid=pfbid0BqNZHQaQfqTAKzVaaeeYNuyPXFJhkPmzwWT7mZPZJLFnHNEvsdbnLJRPkHJDMcqFl","workerid":0}}
{"timestamp":"2026-01-29T01:54:34.451Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2026-01-29T01:54:34.540Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2026-01-29T01:54:34.541Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2026-01-29T01:54:34.542Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}

@badgersec badgersec changed the base branch from main to add-click-links January 29, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants